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Abstract 

We introduce Adam , an algorithm for first-order gradient-based optimization of 
stochastic objective functions, based on adaptive estimates of lower-order mo- 
ments. The method is straightforward to implement, is computationally efficient, 
has little memory requirements, is invariant to diagonal rescaling of the gradients, 
and is well suited for problems that are large in terms of data and/or parameters. 
The method is also appropriate for non-stationary objectives and problems with 
very noisy and/or sparse gradients. The hyper-parameters have intuitive interpre- 
tations and typically require little tuning. Some connections to related algorithms, 
on which Adam was inspired, are discussed. We also analyze the theoretical con- 
vergence properties of the algorithm and provide a regret bound on the conver- 
gence rate that is comparable to the best known results under the online convex 
optimization framework. Empirical results demonstrate that Adam works well in 
practice and compares favorably to other stochastic optimization methods. Finally, 
we discuss AdaMax, a variant of Adam based on the infinity norm. 


1 Introduction 

Stochastic gradient-based optimization is of core practical importance in many fields of science and 
engineering. Many problems in these fields can be cast as the optimization of some scalar parameter- 
ized objective function requiring maximization or minimization with respect to its parameters. If the 
function is differentiable w.r.t. its parameters, gradient descent is a relatively efficient optimization 
method, since the computation of first-order partial derivatives w.r.t. all the parameters is of the same 
computational complexity as just evaluating the function. Often, objective functions are stochastic. 
For example, many objective functions are composed of a sum of subfunctions evaluated at different 
subsamples of data; in this case optimization can be made more efficient by taking gradient steps 
w.r.t. individual subfunctions, i.e. stochastic gradient descent (SGD) or ascent. SGD proved itself 
as an efficient and effective optimization method that was central in many machine learning success 
stories, such as recent advances in deep learning (Deng et al., 2013; Krizhevsky et al., 2012; Hinton 
& Salakhutdinov, 2006; Hinton et al., 2012a; Graves et al., 2013). Objectives may also have other 
sources of noise than data subsampling, such as dropout (Hinton et al., 2012b) regularization. For 
all such noisy objectives, efficient stochastic optimization techniques are required. The focus of this 
paper is on the optimization of stochastic objectives with high-dimensional parameters spaces. In 
these cases, higher-order optimization methods are ill-suited, and discussion in this paper will be 
restricted to first-order methods. 

We propose Adam, a method for efficient stochastic optimization that only requires first-order gra- 
dients with little memory requirement. The method computes individual adaptive learning rates for 
different parameters from estimates of first and second moments of the gradients; the name Adam 
is derived from adaptive moment estimation. Our method is designed to combine the advantages 
of two recently popular methods: AdaGrad (Duchi et al., 2011), which works well with sparse gra- 
dients, and RMSProp (Tielenran & Hinton, 2012), which works well in on-line and non-stationary 
settings; important connections to these and other stochastic optimization methods are clarified in 
section 5. Some of Adam’s advantages are that the magnitudes of parameter updates are invariant to 
rescaling of the gradient, its stepsizes are approximately bounded by the stepsize hyperparameter, 
it does not require a stationary objective, it works with sparse gradients, and it naturally performs a 
form of step size annealing. 

* Equal contribution. Author ordering determined by coin flip over a Google Hangout. 
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Algorithm 1 : Adam, our proposed algorithm for stochastic optimization. See section 2 for details, 
and for a slightly more efficient (but less clear) order of computation, g 4 indicates the elementwise 
square g t © gt ■ Good default settings for the tested machine learning problems are a = 0.001, 
Pi = 0.9, /?2 = 0.999 and e = 10 -8 . All operations on vectors are element-wise. With fi\ and /?| 
we denote Pi and /?2 to the power t. 

Require: a: Stepsize 

Require: Pi, @2 € [0, 1): Exponential decay rates for the moment estimates 
Require: f{9): Stochastic objective function with parameters 6 
Require: 9q\ Initial parameter vector 
too 4— 0 (Initialize 1 st moment vector) 
vq 4 — 0 (Initialize 2 nd moment vector) 
t 4— 0 (Initialize timestep) 
while 9 t not converged do 
t 4 — t -\~ 1 

gt. 4— ^ eft.{9t- 1 ) (Get gradients w.r.t. stochastic objective at timestep t) 
mt 4 — Pi ■ rrit-i + (1 — Pi) ■ gt (Update biased first moment estimate) 
v t 4— P2 • v t -i + (1 — P2) ■ g± (Update biased second raw moment estimate) 
frit 4 — m 4 /(l — p\) (Compute bias-corrected first moment estimate) 
ft. 4— v t /(l — P 2 ) (Compute bias-corrected second raw moment estimate) 

9 t 4— 9 t _ 1 — a ■ fhtlpjvt + e) (Update parameters) 
end while 

return 9 t (Resulting parameters) 


In section 2 we describe the algorithm and the properties of its update mle. Section 3 explains 
our initialization bias correction technique, and section 4 provides a theoretical analysis of Adam’s 
convergence in online convex programming. Empirically, our method consistently outperforms other 
methods for a variety of models and datasets, as shown in section 6. Overall, we show that Adam is 
a versatile algorithm that scales to large-scale high-dimensional machine learning problems. 

2 Algorithm 

See algorithm 1 for pseudo-code of our proposed algorithm Adam. Let f{9) be a noisy objec- 
tive function: a stochastic scalar function that is differentiable w.r.t. parameters 9. We are in- 
terested in minimizing the expected value of this function, E[/(#)] w.r.t. its parameters 9. With 
fi(9 ), ..., , fr{9) we denote the realisations of the stochastic function at subsequent timesteps 
1, The stochasticity might come from the evaluation at random subsamples (nrinibatches) 

of datapoints, or arise from inherent function noise. With g t = V eft {9) we denote the gradient, i.e. 
the vector of partial derivatives of f t , w.r.t 9 evaluated at timestep t. 

The algorithm updates exponential moving averages of the gradient (rn t ) and the squared gradient 
(v t) where the hyper-parameters Pi, P 2 G [0, 1) control the exponential decay rates of these moving 
averages. The moving averages themselves are estimates of the 1 st moment (the mean) and the 
2 nd raw moment (the uncentered variance) of the gradient. However, these moving averages are 
initialized as (vectors of) 0’s, leading to moment estimates that are biased towards zero, especially 
during the initial timesteps, and especially when the decay rates are small (i.e. the /3s are close to 1). 
The good news is that this initialization bias can be easily counteracted, resulting in bias-corrected 
estimates m t and u t . See section 3 for more details. 

Note that the efficiency of algorithm 1 can, at the expense of clarity, be improved upon by changing 
the order of computation, e.g. by replacing the last three lines in the loop with the following lines: 

a t = a - V'l - P\l (1 - Pi) and 9 t 4 - 9 t -i - a t ■ m t /{^/vt + e). 

2.1 Adam’s update rule 

An important property of Adam’s update mle is its careful choice of stepsizes. Assuming e = 0, the 
effective step taken in parameter space at timestep t is A t = a ■ ffit/y/vt- The effective stepsize has 
two upper bounds: |A t | < a ■ (1 — /3i)/i/l — P 2 in the case (1 — Pi) > \/l — P 2 , and |A t | < a 
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otherwise. The first case only happens in the most severe case of sparsity: when a gradient has 
been zero at all timesteps except at the current timestep. For less sparse cases, the effective stepsize 
will be smaller. When (1 — /3i ) = \/l — /?2 we have that \rht/\fvt\ < 1 therefore |A t | < a. In 
more common scenarios, we will have that fht/s/vt ~ ±1 since |E[<?]/-\/E[g 2 ]| — T The effective 
magnitude of the steps taken in parameter space at each timestep are approximately bounded by 
the stepsize setting a, i.e., |A t | ^ a. This can be understood as establishing a trust region around 
the current parameter value, beyond which the current gradient estimate does not provide sufficient 
information. This typically makes it relatively easy to know the right scale of a in advance. For 
many machine learning models, for instance, we often know in advance that good optima are with 
high probability within some set region in parameter space; it is not uncommon, for example, to 
have a prior distribution over the parameters. Since a sets (an upper bound of) the magnitude of 
steps in parameter space, we can often deduce the right order of magnitude of a such that optima 
can be reached from do within some number of iterations. With a slight abuse of terminology, 
we will call the ratio fht/s/vt. the signal-to -noise ratio (SNR). With a smaller SNR the effective 
stepsize A t will be closer to zero. This is a desirable property, since a smaller SNR means that 
there is greater uncertainty about whether the direction of rh t corresponds to the direction of the true 
gradient. For example, the SNR value typically becomes closer to 0 towards an optimum, leading 
to smaller effective steps in parameter space: a form of automatic annealing. The effective stepsize 
A t is also invariant to the scale of the gradients; rescaling the gradients g with factor c will scale rht 
with a factor c and v t with a factor c 2 , which cancel out: (c • fht)/{s/c 2 • Vt) = fht/s/vt- 

3 Initialization bias correction 

As explained in section 2, Adam utilizes initialization bias correction terms. We will here derive 
the term for the second moment estimate; the derivation for the first moment estimate is completely 
analogous. Let g be the gradient of the stochastic objective /, and we wish to estimate its second 
raw moment (uncentered variance) using an exponential moving average of the squared gradient, 
with decay rate /? 2 - Let gi, ...,gT be the gradients at subsequent timesteps, each a draw from an 
underlying gradient distribution g t ~ p(gt). Let us initialize the exponential moving average as 
Vo = 0 (a vector of zeros). First note that the update at timestep t of the exponential moving average 
Vt = fh ■ Vt - 1 + (1 — /?2 )■ 9 1 (where g 2 indicates the elementwise square g t © g t ) can be written as 
a function of the gradients at all previous timesteps: 


t 

vt = ( 1 - fa) Y ^ 2 ~ l ' 9i 

i—1 


( 1 ) 


We wish to know how E[r; t ], the expected value of the exponential moving average at timestep t, 
relates to the true second moment E [g 2 \, so we can correct for the discrepancy between the two. 
Taking expectations of the left-hand and right-hand sides of eq. (1): 


E[n t ] = E 


(1 -p2)Y^~ i '9l 

i= 1 


= E[ ff 2 ]-(l-/3 2 )^^- i + C 

i= 1 

= E[g?] ■(!-&) + ( 


(2) 

(3) 

(4) 


where ( = 0 if the true second moment E [gf] is stationary; otherwise can be kept small since 
the exponential decay rate /?i can (and should) be chosen such that the exponential moving average 
assigns small weights to gradients too far in the past. What is left is the term (1 — j3\) which is 
caused by initializing the running average with zeros. In algorithm 1 we therefore divide by this 
term to correct the initialization bias. 

In case of sparse gradients, for a reliable estimate of the second moment one needs to average over 
many gradients by chosing a small value of however it is exactly this case of small /?2 where a 
lack of initialisation bias correction would lead to initial steps that are much larger. 
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4 Convergence analysis 


We analyze the convergence of Adam using the online learning framework proposed in (Zinkevich, 
2003). Given an arbitrary, unknown sequence of convex cost functions fi(9), /2(d),..., f r r(fi). At 
each time t, our goal is to predict the parameter 9 t and evaluate it on a previously unknown cost 
function f t . Since the nature of the sequence is unknown in advance, we evaluate our algorithm 
using the regret, that is the sum of all the previous difference between the online prediction /*(#*) 
and the best fixed point parameter f t (9* ) from a feasible set X for all the previous steps. Concretely, 
the regret is defined as: 

T 

= (5) 

t= 1 

where 9* = argmingg^ Y/t=i We show Adam has 0(\/T) regret bound and a proof is given 
in the appendix. Our result is comparable to the best known bound for this general convex online 
learning problem. We also use some definitions simplify our notation, where g t = V ft(0f) and 9t,i 
as the I th element. We define gi-.t.i G K f as a vector that contains the i th dimension of the gradients 

A Q 2 

over all iterations till t, gi.t,i = [si,i, • • • ,gt,i\- Also, we define 7 = Our following 

theorem holds when the learning rate a t is decaying at a rate of t~ 2 and first moment running 
average coefficient decay exponentially with A, that is typically close to 1, e.g. 1 — 10 -8 . 

Theorem 4.1. Assume that the function f t has bounded gradients, ||V/t(#)||2 < G, || V/t (0) || oo < 
Goo for all 0 €E R d and distance between any 9t generated by Adam is bounded, \ 9 n — 6 nl \ | 2 < D, 

o2 

|| 0 m - 9 n ||oo < Doo for any m, n G {1, ..., Tj, and f3i, /3 2 G [0, 1) satisfy ^ < 1. Let a t = ^ 
and f3\ t t = /?iA t_1 , A G (0, 1). Adam achieves the following guarantee, for all T > 1. 


R(T) < 


D 2 


2a(l-/3i) 




Vr,i + 


cr( 1 + Pi)G 0 


(1 - Pi)y/1 ~ (3 2 { 1 - 7) 2 fr 


— ¥ Eii £,i:T '*ii 2+ E 


DloG oos/1 - P-2 

2a(l -/3r)(l -A) 2 


Our Theorem 4.1 implies when the data features are sparse and bounded gradients, the sum- 
mation term can be much smaller than its upper bound J2i = 1 ||<?i:T,*||2 << dGooVT and 
J2i = 1 s/Tvtj. < < dGoo s/T, in particular if the class of function and data features are in the form of 
section 1.2 in (Duchi et al., 2011). Their results for the expected value E Ef =1 ||<?i:T,i||2] a PPly 
to Adam. In particular, the adaptive method, such as Adam and Adagrad, can achieve 0(log ds/T), 
an improvement over O(VdT) for the non-adaptive method. Decaying (3 \y. towards zero is impor- 
tant in our theoretical analysis and also matches previous empirical findings, e.g. (Sutskever et al., 
2013) suggests reducing the momentum coefficient in the end of training can improve convergence. 


Finally, we can show the average regret of Adam converges, 

Corollary 4.2. Assume that the function f t has bounded gradients, 1 1 V /* ( 0 ) 1 1 2 < G, ||V/t(0)|| oo < 
Goo far all 9 G R d and distance between any 9t generated by Adam is bounded, || 9 n — 9 m \\ 2 < D, 
|| 0 m — 9 n 1 1 00 < Doo for any m,n G {1, ..., T}. Adam achieves the following guarantee, for all 


T > 1. 


R(T) 

T 



) 


This result can be obtained by using Theorem 4.1 and J2i= 1 ||ffi:T,i||2 < dGoo s/T. Thus, 

limr^oo = 0. 


5 Related work 

Optimization methods bearing a direct relation to Adam are RMSProp (Tieleman & Hinton, 2012; 
Graves, 2013) and AdaGrad (Duchi et al., 2011); these relationships are discussed below. Other 
stochastic optimization methods include vSGD (Schaul et al., 2012), AdaDelta (Zeiler, 2012) and the 
natural Newton method from Roux & Fitzgibbon (2010), all setting stepsizes by estimating curvature 
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from first-order information. The Sum-of-Functions Optimizer (SFO) (Sohl-Dickstein et al., 2014) 
is a quasi-Newton method based on minibatches, but (unlike Adam) has memory requirements linear 
in the number of minibatch partitions of a dataset, which is often infeasible on memory-constrained 
systems such as a GPU. Like natural gradient descent (NGD) (Amari, 1998), Adam employs a 
preconditioner that adapts to the geometry of the data, since vt is an approximation to the diagonal 
of the Fisher information matrix (Pascanu & Bengio, 2013); however, Adam’s preconditioner (like 
AdaGrad’s) is more conservative in its adaption than vanilla NGD by preconditioning with the square 
root of the inverse of the diagonal Fisher information matrix approximation. 

RMSProp: An optimization method closely related to Adam is RMSProp (Tielenran & Hinton, 
2012). A version with momentum has sometimes been used (Graves, 2013). There are a few impor- 
tant differences between RMSProp with momentum and Adam: RMSProp with momentum gener- 
ates its parameter updates using a momentum on the rescaled gradient, whereas Adam updates are 
directly estimated using a running average of first and second moment of the gradient. RMSProp 
also lacks a bias-correction term; this matters most in case of a value of fa close to 1 (required in 
case of sparse gradients), since in that case not correcting the bias leads to very large stepsizes and 
often divergence, as we also empirically demonstrate in section 6.4. 


AdaGrad: An algorithm that works well for sparse gradients is AdaGrad (Duchi et al., 2011). Its 

basic version updates parameters as 9 t+ 1 = 9 t — a ■ gt/ \jY/\= i 9t ■ Note that if we choose to be 

infinitesimally close to 1 from below, then lim^^i v t = t _1 • Y/li=i 9t ■ AdaGrad corresponds to a 
version of Adam with /3i = 0, infinitesimal (1-/32) and a replacement of a by an annealed version 

a t =a- t~ l/2 , namely 9 t - a ■ i _1/2 • m t / ^/lim^i v t = 9 t - a ■ t~ 1/2 ■ g t / \Jt~ l ■ Y/\=\9t = 

9 t — a ■ 9t./ \jY/\= i 9t- Note that this direct correspondence between Adam and Adagrad does 
not hold when removing the bias-correction terms; without bias correction, like in RMSProp, a P 2 
infinitesimally close to 1 would lead to infinitely large bias, and infinitely large parameter updates. 


6 Experiments 

To empirically evaluate the proposed method, we investigated different popular machine learning 
models, including logistic regression, multilayer fully connected neural networks and deep convolu- 
tional neural networks. Using large models and datasets, we demonstrate Adam can efficiently solve 
practical deep learning problems. 

We use the same parameter initialization when comparing different optimization algorithms. The 
hyper-parameters, such as learning rate and momentum, are searched over a dense grid and the 
results are reported using the best hyper-parameter setting. 

6.1 Experiment: Logistic Regression 

We evaluate our proposed method on L2 -regularized multi-class logistic regression using the MNIST 
dataset. Logistic regression has a well-studied convex objective, making it suitable for comparison 
of different optimizers without worrying about local minimum issues. The stepsize a in our logistic 
regression experiments is adjusted by 1 / \ft decay, namely a t = ^ that matches with our theorat- 
ical prediction from section 4. The logistic regression classifies the class label directly on the 784 
dimension image vectors. We compare Adam to accelerated SGD with Nesterov momentum and 
Adagrad using minibatch size of 128. According to Figure 1, we found that the Adam yields similar 
convergence as SGD with momentum and both converge faster than Adagrad. 

As discussed in (Duchi et al., 2011), Adagrad can efficiently deal with sparse features and gradi- 
ents as one of its main theoretical results whereas SGD is low at learning rare features. Adam with 
1 /\ft decay on its stepsize should theoratically match the performance of Adagrad. We examine the 
sparse feature problem using IMDB movie review dataset from (Maas et al., 2011). We pre-process 
the IMDB movie reviews into bag-of-words (BoW) feature vectors including the first 10,000 most 
frequent words. The 10,000 dimension BoW feature vector for each review is highly sparse. As sug- 
gested in (Wang & Manning, 2013), 50% dropout noise can be applied to the BoW features during 
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Figure 1: Logistic regression training negative log likelihood on MNIST images and IMDB movie 
reviews with 10,000 bag-of-words (BoW) feature vectors. 


training to prevent over-fitting. In figure 1, Adagrad outperforms SGD with Nesterov momentum 
by a large margin both with and without dropout noise. Adam converges as fast as Adagrad. The 
empirical performance of Adam is consistent with our theoretical findings in sections 2 and 4. Sim- 
ilar to Adagrad, Adam can take advantage of sparse features and obtain faster convergence rate than 
normal SGD with momentum. 

6.2 Experiment: Multi-layer Neural Networks 

Multi-layer neural network are powerful models with non-convex objective functions. Although 
our convergence analysis does not apply to non-convex problems, we empirically found that Adam 
often outperforms other methods in such cases. In our experiments, we made model choices that are 
consistent with previous publications in the area; a neural network model with two fully connected 
hidden layers with 1000 hidden units each and ReLU activation are used for this experiment with 
minibatch size of 128. 

First, we study different optimizers using the standard deterministic cross-entropy objective func- 
tion with Z /2 weight decay on the parameters to prevent over-fitting. The sum-of-functions (SFO) 
method (Sohl-Dickstein et al., 2014) is a recently proposed quasi-Newton method that works with 
minibatches of data and has shown good performance on optimization of multi-layer neural net- 
works. We used their implementation and compared with Adam to train such models. Figure 2 
shows that Adam makes faster progress in terms of both the number of iterations and wall-clock 
time. Due to the cost of updating curvature information, SFO is 5-10x slower per iteration com- 
pared to Adam, and has a memory requirement that is linear in the number minibatches. 

Stochastic regularization methods, such as dropout, are an effective way to prevent over-fitting and 
often used in practice due to their simplicity. SFO assumes deterministic subfunctions, and indeed 
failed to converge on cost functions with stochastic regularization. We compare the effectiveness of 
Adam to other stochastic first order methods on multi-layer neural networks trained with dropout 
noise. Figure 2 shows our results; Adam shows better convergence than other methods. 

6.3 Experiment: Convolutional Neural Networks 

Convolutional neural networks (CNNs) with several layers of convolution, pooling and non-linear 
units have shown considerable success in computer vision tasks. Unlike most fully connected neural 
nets, weight sharing in CNNs results in vastly different gradients in different layers. A smaller 
learning rate for the convolution layers is often used in practice when applying SGD. We show the 
effectiveness of Adam in deep CNNs. Our CNN architecture has three alternating stages of 5x5 
convolution filters and 3x3 max pooling with stride of 2 that are followed by a fully connected layer 
of 1000 rectified linear hidden units (ReLU’s). The input image are pre-processed by whitening, and 
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(a) 




Figure 2: Training of multilayer neural networks on MNIST images, (a) Neural networks using 
dropout stochastic regularization, (b) Neural networks with deterministic cost function. We compare 
with the sum-of-functions (SFO) optimizer (Sohl-Dickstein et al., 2014) 



CIFAR10 ConvNet 



Figure 3: Convolutional neural networks training cost, (left) Training cost for the first three epochs, 
(right) Training cost over 45 epochs. CIFAR-10 with c64-c64-c 128- 1000 architecture. 


dropout noise is applied to the input layer and fully connected layer. The minibatch size is also set 
to 128 similar to previous experiments. 

Interestingly, although both Adam and Adagrad make rapid progress lowering the cost in the initial 
stage of the training, shown in Figure 3 (left), Adam and SGD eventually converge considerably 
faster than Adagrad for CNNs shown in Figure 3 (right). We notice the second moment estimate vt 
vanishes to zeros after a few epochs and is dominated by the e in algorithm 1 . The second moment 
estimate is therefore a poor approximation to the geometry of the cost function in CNNs comparing 
to fully connected network from Section 6.2. Whereas, reducing the minibatch variance through 
the first moment is more important in CNNs and contributes to the speed-up. As a result, Adagrad 
converges much slower than others in this particular experiment. Though Adam shows marginal 
improvement over SGD with momentum, it adapts learning rate scale for different layers instead of 
hand picking manually as in SGD. 
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p s =0.99 p S=0.999 p s =0.9999 p s =0.99 ps=0.999 Pn=0.9999 



Figure 4: Effect of bias-correction terms (red line) versus no bias correction terms (green line) 
after 10 epochs (left) and 100 epochs (right) on the loss (y-axes) when learning a Variational Auto- 
Encoder (VAE) (Kingma & Welling, 2013), for different settings of stepsize a (x-axes) and hyper- 
parameters /?i and /3 2 . 


6.4 Experiment: bias-correction term 

We also empirically evaluate the effect of the bias correction terms explained in sections 2 and 3. 
Discussed in section 5, removal of the bias correction terms results in a version of RMSProp (Tiele- 
man & Hinton, 2012) with momentum. We vary the /3j and /3 2 when training a variational auto- 
encoder (VAE) with the same architecture as in (Kingma & Welling, 2013) with a single hidden 
layer with 500 hidden units with softplus nonlinearities and a 50-dimensional spherical Gaussian 
latent variable. We iterated over a broad range of hyper-parameter choices, i.e. /3i G [0, 0.9] and 
/3 2 G [0.99, 0.999, 0.9999], and log 10 (a) G [— 5,...,— 1]. Values of /? 2 close to 1, required for robust- 
ness to sparse gradients, results in larger initialization bias; therefore we expect the bias correction 
term is important in such cases of slow decay, preventing an adverse effect on optimization. 

In Figure 4, values ;3 2 close to 1 indeed lead to instabilities in training when no bias correction term 
was present, especially at first few epochs of the training. The best results were achieved with small 
values of (1 — /3 2 ) and bias correction; this was more apparent towards the end of optimization when 
gradients tends to become sparser as hidden units specialize to specific patterns. In summary, Adam 
performed equal or better than RMSProp, regardless of hyper-parameter setting. 


7 Extensions 


7.1 AdaMax 

In Adam, the update rule for individual weights is to scale their gradients inversely proportional to a 
(scaled) L 2 norm of their individual current and past gradients. We can generalize the L 2 norm based 
update rule to a L p norm based update rule. Such variants become numerically unstable for large 
p. However, in the special case where we let p — > oo, a surprisingly simple and stable algorithm 
emerges; see algorithm 2. We’ll now derive the algorithm. Let, in case of the L v norm, the stepsize 
at time t be inversely proportional to v]/‘\ where: 



(6) 

(i-p p 2 )j2p p 2 {t - i] -\ gi \ p 

(7) 


i= 1 
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Algorithm 2 : AdaMax, a variant of Adam based on the infinity norm. See section 7.1 for details. 
Good default settings for the tested machine learning problems are a = 0.002, 8i = 0.9 and 
8-2 = 0.999. With 8 \ we denote 8\ to the power t. Here, (a/(l — 8\)) is the learning rate with the 
bias-correction term for the first moment. All operations on vectors are element-wise. 

Require: a: Stepsize 

Require: 81,82 € [0,1): Exponential decay rates 
Require: f(9): Stochastic objective function with parameters 6 
Require: 0 fl : Initial parameter vector 
too <— 0 (Initialize 1 st moment vector) 
uq i — 0 (Initialize the exponentially weighted infinity norm) 
t <— 0 (Initialize timestep) 
while 9 t not converged do 
t i — t 1 

g t <r- Vgft(6t- 1 ) (Get gradients w.r.t. stochastic objective at timestep t ) 
rrit <— 81 • m t- 1 + (1 — 81) ‘ 9t (Update biased hrst moment estimate) 

Ut max(/32 • Ut-\ • |5t|) (Update the exponentially weighted inhnity norm) 

9 t 9 t _i — (a/(l - 81)) • m t /ut (Update parameters) 

end while 

return 9 t (Resulting parameters) 


Note that the decay term is here equivalently parameterised as 82 instead of 82- Now let p — > 00 , 
and define Ut = hm p ^ 00 (ht) 1 / p , then: 


Ut = iim ( 

p — y OO 


= lim 

(u - PI 

p — >00 

= lim (1 - 82 Y 

P — >00 


= lim 

p — >00 


= max 

(/TV 1 


t \ 1 /p 

” ■ \ 9i \P I 


i= 1 


'2 

) 

t \ VP 


3 p(i-i) | 

\ !/P 


at— 2 1 


Which corresponds to the remarkably simple recursive formula: 

u t = max(/?2 • u t - 1, | 5 t|) 


( 8 ) 

(9) 

( 10 ) 

(H) 

( 12 ) 


with initial value u 0 = 0. Note that, conveniently enough, we don’t need to correct for initialization 
bias in this case. Also note that the magnitude of parameter updates has a simpler bound with 
AdaMax than Adam, namely: |A t | < a. 


7.2 Temporal averaging 

Since the last iterate is noisy due to stochastic approximation, better generalization performance is 
often achieved by averaging. Previously in Moulines & Bach (2011), Polyak-Ruppert averaging 
(Polyak & Juditsky, 1992; Ruppert, 1988) has been shown to improve the convergence of standard 
SGD, where 9 t = j Y^k=i #fc- Alternatively, an exponential moving average over the parameters can 
be used, giving higher weight to more recent parameter values. This can be trivially implemented 
by adding one line to the inner loop of algorithms 1 and 2: 9 t <— 82 ■ 9t- 1 + (1 — 82)81, with #0 = 0. 
Initalization bias can again be corrected by the estimator 9 t = 9 t /(l — 82)- 


8 Conclusion 

We have introduced a simple and computationally efficient algorithm for gradient-based optimiza- 
tion of stochastic objective functions. Our method is aimed towards machine learning problems with 
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large datasets and/or high-dimensional parameter spaces. The method combines the advantages of 
two recently popular optimization methods: the ability of AdaGrad to deal with sparse gradients, 
and the ability of RMSProp to deal with non-stationary objectives. The method is straightforward 
to implement and requires little memory. The experiments confirm the analysis on the rate of con- 
vergence in convex problems. Overall, we found Adam to be robust and well-suited to a wide range 
of non-convex optimization problems in the held machine learning. 
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10 Appendix 


10.1 Convergence Proof 

Definition 10.1. A function f : R d — > R is convex if for all x, y G R d , for all A G [0, 1], 

\f(x) + (l-X)f(y)>f(Xx + (l-X)y) 

Also, notice that a convex function can be lower bounded by a hyperplane at its tangent. 
Lemma 10.2. If a function f : R d — >■ Ris convex, then for all x, y G R d , 

f(y ) > f(x) + Vf(x) T (y - x) 


The above lemma can be used to upper bound the regret and our proof for the main theorem is 
constructed by substituting the hyperplane with the Adam update rules. 


The following two lemmas are used to support our main theorem. We also use some definitions sim- 
plify our notation, where g t = V ft{0 t ) and 9t,i as the i th element. We define G R f as a vector 
that contains the I th dimension of the gradients over all iterations till t, g\ : t,% = [gi,i, 92, i, • • • , gt,i\ 

Lemma 10.3. Let g t = V ft{0t) and gv.t be defined as above and bounded, \\gt\\2 < G, ||<?t||oo < 
Goa. Then, 


E 



< 2G 00 \\gi:T,i 


2 


Proof. We will prove the inequality using induction over T. 


The base case for T = 1, we have 


For the inductive step. 


E 





< 2G 00 ||5l : T-l,i||2 





From, ||ffi:T,*||2 — g“ri + 4 1 1 iE - 1 1 ^ — IlffET.tlll — 9tv we can take square root of both side and 
have, 


\\9VT,m-9h<\\9l-.T,ih- 


< \\gi:T,ih - 


9r,i 


2||<7l:T,*| 

9r,i 

2 y/TGl 


Rearrange the inequality and substitute the y/|| ||| — g^i term, 


Goo V IlfiliT.illl - 9t + \l ~fT < 2Goo||ffl:T,i||2 


9ta 


□ 
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. o 2 q 2 

Lemma 10.4. Let 7 = -^A=. For Pi, p 2 £ [0,1) that satisfy -fy- < 1 and bounded g t , \\gp \2 < G, 
1 1 9t 1 1 00 < Goo, the following inequality holds 


T ^9 

m t,i < 2 


E 


1 


1-7 V 1 - PS 


IlffliT, 


* m 2 


Proof. Under the assumption, . We can expand the last term in the summation 

using the update rules in Algorithm 1, 

m Li _ m t,i \/l ~ (Sfc=t(l ~ Pi) Pi k 9k,i ) 2 

T— 1 —2 

m 


< y- "H,i \/l ~ /^T y- ?"((! ~ Pi)Pi k 9k,i ) 2 

“ it 0 - t #) 2 it ^TY J T j=1 {i-p2)P T 2 - j gl i 

V^W X~^ ^((1 — /^1 )/^l^ k 9k,i) 2 


T— i -2 


^E 

t=i 
T - 1 

^E 


r-r ^2 


+ 


+ 


(1 “ ^ )2 S ^T(i - p 2 )PZ- k gl 
V 1 - P2 iX~Pi ) 2 \ 


T—k 


Vt.i (1 - /3f ) 2 vr(i - ^ 2 ) W&7 


llfffc, 


i 2 


E 1 i 

<E^ + 


T 


t= 1 


% V ^ 1 - > 02 ) ^ 


EE fc ||fffc,i.U 2 


Similarly, we can upper bound the rest of the terms in the summation. 


it ~ h 


\\gt,ih Y^f 7 




ll#M 


\A( X - ft) £ 




j=0 


For 7 < 1, using the upper bound on the arithmetic-geometric series, J2 t £ 7 * < 

1 


y- II 9 t,ih y- , j < 1 y- llffulk 

it \//(l ~E) ho “ (! - 7) 2 V1 - A it V* 


Apply Lemma 10.3, 


T 


E 


< 


2G„ 


(1 - 7 ) Vi - /G 


IlflliT, 


*I |2 


□ 


. o 2 

To simplify the notation, we define 7 = ~^= . Intuitively, our following theorem holds when the 

learning rate at is decaying at a rate of t~ i and first moment running average coefficient /3i jt decay 
exponentially with A, that is typically close to 1, e.g. 1 — 10 -8 . 

Theorem 10.5. Assume that the function f t has bounded gradients, ||V/t( 0)||2 < G, ||V/t(0)|| oo < 
Goo for all 9 £ R d and distance between any 9t generated by Adam is bounded. \ 9 n — 6 nl \ | 2 < D. 
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o2 

\\9 m - 0 n Hoc < Doo for any to, n G {1, ..., T}, and /3±, /3 2 G [0, 1) satisfy < 1. Let a t = ^ 
one? /3i jt = A G (0, 1). Adam achieves the following guarantee, for all T > 1. 


i?(T) < 


L> 2 


2a(l-/3i) 


e \/T%’,i+ 


a(/?i + 1)G 0 


(1 - ^i)Vl - h 2 (l - 7) 2 “T 


E Ilfli^lla+E 


£&<Wi - g 

2a(l-/3r)(l- A) 2 


Proof. Using Lemma 10.2, we have, 


ft(0t) - ft(0*) < gT(0t - e*) y^undh.i - 0 *) 


i—1 


From the update rules presented in algorithm 1, 


9 t+1 =0 t - atfht/s/vt 
_ a “t ( Pht 


_ , (1 — Pi,t) 

We focus on the i* dimension of the parameter vector 6 t G R d . Subtract the scalar 9* i and square 
both sides of the above update rule, we have, 

(#t+r,; - 9*i) 2 =(9 tyi - 0*) 2 - o t ( + (1 — -^Lg t ,i){9 t j - #*) + a 2 { t^—)“ 

f - hi yjv tii y/v tti 


Pi \f%~i 

We can rearrange the above equation and use Young’s inequality, ab < a 2 /2 + b 2 / 2. Also, it can be 
shown that y/v^i = yEjE 1 “ ~ J 9j, i /V 1 ~ P 2 < llfli:t,i||2 and/3 M < /3i. Then 

~ 0}) (ihi ~ 0*,) 2 - (9 t+u - 0 *) : 

1 

Pi ,t v£_i ti a t (l-pl)y/v^i rht t i 2 

+ + 2(1 -A,,) 


< 


2cr t (l — 


^ (0t+M 0 > i ) 2 ) V ^ + 2a t _ 1 fl-A it ) (0 > < 0 m) 2 V^M 




a t 171 t,i 


2(1 m ) v / ^ 2(1 -ft) v ^- 

We apply Lemma 10.4 to the above inequality and derive the regret bound by summing across all 
the dimensions for i G 1, ..., d in the upper bound of ft(9 t ) — ft(9*) and the sequence of convex 
functions for t G 1, T: 


R(T) <E 


1 


d T 


rt 2ar(l-/3r) 

+ 


(#l,i - S*i ) 2 y/v^i + EE 




1 . _ 0* ) 2 ( y/^Li V Vt ~ 1 ’ i ' 


a* a t _i 

d 


fiiaGoo \ ^ ,, ,, otGoo \ ^ ,, ,, 

d-ft)vi-A(i-7) z h ll!,mlla + FwrSFif h ll9l:rj " 2 


d T 
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From the assumption, \\6 t — 0*\\2 < D, \ 0 m — S^Hoo < D 0 c , we have: 




cr(l + fii)G 0 


D 2 


Pl.t 




(i - A)vT=®( l - i) 1 ^ (1 - A,.) 

cr(l + /3l)C?oo \ ^ ,, ,, 

(1 - Pi)VT^fc(l - 7) 2 h ll9l:T,i 1,2 


D^Goo^/1 — (3-2 \ - ^ /?r,t fi 

+ * SS(^Ai)' / ‘ 


2a 


We can use arithmetic geometric series upper bound for the last term: 




< 


(1-ftXl-A)’ 


Therefore, we have the following regret bound: 




cr(l + /Si) Go 


(1 - /?i)\A - # 2(1 - 7) 2 ~" 


y iis , i:T ) iii2 + y 


jgoGooy^T^ 

2a/3i(l — A) 2 


□ 
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